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ABSTRACT 

The Cell Broad Engine (BE) Processor has unique memory 
access architecture besides its powerful computing engines. 
Many computing-intensive applications have been ported to 
Cell/BE successfully. But memory-intensive applications 
are rarely investigated except for several micro benchmarks. 
Since Cell/BE has powerful software visible DMA engine, 
this paper studies on whether Cell/BE is suit for applica- 
tions with large amount of random memory accesses. Two 
benchmarks, GUPS and SSCA#2, are used. The latter is 
a rather complex one that in representative of real world 
graph analysis applications. We find both benchmarks have 
good performance on CeU/BE based IBM QS20/22. Com- 
pared with 2 conventional multi-processor systems with the 
same core/thread number, GUPS is about 40-80% fast and 
SSCA#2 about 17-30% fast. The dynamic load balanc- 
ing and software pipeline for optimizing SSCA#2 are intro- 
duced. Based on the experiment, the potential of Cell/BE 
for random access is analyzed in detail as well as its limita- 
tions of memory controller, atomic engine and TLB manage- 
ment. Our research shows although more programming effort 
are needed, Cell/BE has the potencial for irregular memory 
access applications. 

Categories and Subject Descriptors 

D.1.3 [Programming Techniques]: Concurrent Programming 
Parallel programming; C.4 [Performance of Systems]: 

Design studies 

General Terms 

programming, performance 

Keywords 

CeU/BE, Random Access 



1. INTRODUCTION 

Multi-core and many-core architectures have been widely 
investigated in recent days. The Cell Broadband Engine 
(Cell/BE) 13 is a unique architectural multi-core design 
by Sony, Toshiba, and IBM (STI). There have been a lot 
of studies on computing-intensive applications on Cell/BE. 
Though primarily targeting high performance multimedia 
and gaming application, the Cell/BE has a unique mem- 
ory architecture compared with convention multi-core CPU. 
CeU/BE has a 204GB/s internal bus and 25.6GB/s main 
memory access bandwidth. More specially Cell/BE allows 
the program to fully control the memory access via explic- 
itly DMA operations. Total 128 DMA operations may exist 
simultaneously in theory. 

At the same time, there are large collections of applica- 
tions with randomly memory access behaviors such as graph 
exploration [l][T4]. This kind of applications is not suitable 
for the conventional cache-based multi-core processors. In 
such applications, the data set is much larger than the pro- 
cessor cache and the data access pattern are nearly random 
with neither temporal locality nor spatial locality. The com- 
putation ratio is normally small compared with the memory 
access overhead, which leaves the most powerful FPUs in 
modern processors useless. 

A common myth about the Cell/B.E.'s memory subsys- 
tem is that it is inadequate for irregular data accesses due to 
the software intervention in the memory access mechanism. 
Yet, this additional increase (few instructions) is relatively 
small compared to the hundred cycles or even more DRAM 
access latency. Also, as the Cell/BE enables fine-grained 
control over data transfer, we can apply multiple techniques 
to hide the memory access latency. 

In this paper, we investigate if the unique design of mem- 
ory system in Cell/BE was suit for memory-intensive ap- 
plications. Previous works have studied on certain kernel 
applications. 16 gave a completely micro benchmark on 
communication network of Cell/BE. '6^ implemented list- 
ranking using software managed thread. [l9] presented a 
lock-free BFS algorithm utilizing the Cell/BE on-chip mem- 
ory for bitmap. 12 studied on large FFT over Cell/BE. 
However, all these applications are rather simple kernels 
than real world applications. 

Our study is based on two public benchmarks also. One 
is GUPS [4], which is part of the HPC Challenge bench- 
mark suite; the other is SSCA2 benchmark [15[ [l], which 



is one of the HPCS Scalable Synthetic Compact Applica- 
tions previously. The GUPS is a pure exhaustive random 
access benchmark kernel. Its performance is given by Giga 
Updates Per Seconds. We use it to evaluate the capability 
of the Cell/BE memory system. The SSCA#2 is a rela- 
tive complex benchmark, which came from real word graph 
analysis applications include network analysis, data mining 
and computational biology etc. SSCA#2 computes the be- 
tweenness centrality of each vertex in a weighted directed 
graph. The performance metric is Traversed Edges Per Sec- 
ond (TEPS). The algorithm we used was proposed in 
[7], which is in fact a BFS flow associated with stateful and 
coherent data structure. 

We have implemented both benchmarks for Cell/BE with 
detailed experimental evaluation on IBM QS 20(and QS22) 
CeU/BE blade. Overall results show that CeU/BE is 17% 
-80% faster than traditional cache-based multi-core SMP 
system with the same core/threads and near memory band- 
width. Our work demonstrates that Cell/BE has the poten- 
tial to deal with complex memory-intensive applications. 

Our main contributions are summarized here: 

• We get a 0.062 GUPS on QS 20, which is more than 
40-80% higher compared with 2 16 core/thread con- 
ventional multi-core system. 

• We show that the Cell/BE DMA-list mechanism has 
even more potential for random access. Only 2 of the 
16 SPEs will reach 97% of the peak performance. 

• We find the Cell/BE TLB update mechanism affect the 
performance greatly. The performance nearly doubled 
after adopting huge- TLB configuration. 

• Using dynamic load balancing and software pipeline 
mechanism, we achieve a 65. 8M TEPS for the SSCA#2 
benchmark, which is about 17-30% faster than conven- 
tional multi-core system. 

• By profiling the SSCA#2 implementation, we find the 
atomic operations occupied the most time delay that 
limited Cell/BE to get even better result. 

The remainder of this paper is organized as follows. Sec- 
tion 2 gives a brief overview of the Cell/BE and QS Blade 
memory system as well as the GUPS and SSCA#2 bench- 
marks. Section 3 describes our GUPS implementation with 
detailed experiments to evaluate the maximum random ac- 
cess performance of Cell/BE. Section 4 presents our tech- 
niques in implementing SSCA2 on Cell/BE. Section 5 are 
test and profiling results of the SSCA#2 implementation. 
Section 6 we compares the related works. Section 7 con- 
cludes the paper. 

2. THE CELL/BE ARCHITECTURE, GUPS 
AND SSCA2 BENCHMARK 

2.1 The architecture of IBM BladeCenter QS20/22 

Cell Broadband Engine is well known as a heterogeneous 
multi-core chip 13 . It consists one traditional general- 
purpose 64-bit PowerPC core (PPE) and eight 128-bit SIMD 
coprocessor cores (Synergistic Processor Element, SPE). All 
nine cores are connected via a high bandwidth bus called 
Element Interconnect Bus (EIB) and share coherent main 
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Figure 1: QS20/22 memory architecture 



memory. The IBM BladeCenter QS20 and QS22 Blades are 
dual-processor system implementation based on Cell/BE. 

Figure 1 gives an outline of QS20/22 memory architec- 
ture. Each Processor has a memory controller with band- 
width of 25.6GB/s. The two processors are interconnected 
using FlexIO interface running the fully coherent Broadband 
Interface (BIF) protocol. The bandwidth between two pro- 
cessors is 20GB/s. As seen from the programmer, the QS 
blade simply consists 16 shared-memory SPEs and 2 PPEs. 

The main difference between the QS20 and QS22 is the 
external memory. The QS20 is configured with 1GB of XDR 
(Rambus) memory, while the QS22 using DDR2 SDRAM up 
to 32 GB. In section 4 we will show that the XDR memory 
has a little better random access performance than DDR2 
version. 

Each SPE consists of a synergistic processor unit (SPU) 
and a memory flow controller (MFC). The SPE has no local 
cache but a 256 KB high performance local storage. SPU 
core accesses data only from local storage. All external mem- 
ory access and communications with other cores are through 
the MFC. The MFC includes a DMA controller, a memory 
management unit (MMU), and an atomic unit for synchro- 
nization. 

The MFC DMA controller can queue up to 16 DMA op- 
erations at the same time. The operation can be either a 
single DMA or a scattered DMA-list. So the whole system 
can support more than 250 outstanding memory operations. 
Each MFC also has an atomic unit that handles atomic op- 
eration, but only one reservation at a time is allowed. By 
default virtual memory is managed by hardware, each MFC 
has a 256-entry TLB with default 4KB page size. 

We will see in Section 3 that the DMA queue brings more 
power than the memory controllers can support, while the 
limited TLB page size affect performance greatly. 

2.2 The Random Access benchmark (GUPS) 

The Random Access test is part of the HPC Challenge 
benchmark 2 developed for the HPCS program. The test 
intended to exercise the GUPS capability of a system. 

GUPS is a measurement that profiles the memory archi- 
tecture of a system and is a measure of performance similar 
to MFLOPS. GUPS is calculated by identifying the number 
of memory locations that can be randomly updated in one 
second, divided by 1 billion. 

The basic Random Access benchmark definition [4] is: Let 
r[] be a table of size 2". Let Ai be a stream of 64-bit integers 
of length 2""*^^ generated by the primitive polynomial over 
GF(2) , X'^^+X'^+X + l. For each ai, set r[a,;{63,64-n)] = 
r[ai{63, 64 - 7i)] + a. Where denotes addition in GF(2). 



ai(l, fc) denotes the sequence of bits within a^. 

The parameter n defined such that: n is the largest power 
of 2 that is less than or equal to half of main memory. 
The look ahead and storage before processing on distributed 
memory multi-processor systems is limited to 1024 per pro- 
cess. A small percentage of error (not exceed 1%) is allowed 
for parallelization. 

GUPS is good candidate for evaluating the random mem- 
ory performance of a system. The process is too compact 
to allow further program optimization. We use GUPS as a 
micro benchmark tool for our study first. 

2.3 The HPCS Scalable Synthetic Compact Ap- 
plications graph analysis #2 

The SSCA benchmark suite is part of DARPA High Pro- 
ductivity Computing Systems (HPCS) program. These bench- 
marks aimed to be complements to current scalable micro- 
benchmarks and complex real applications. SSCA#2 is a 
graph theoretic problem, which is representative of compu- 
tations in the field of social network, computational biology 
and data mining etc. 

Our study is based on SSCA#2 v2.2 [Ts] specification and 
the C/OpenMP implementation 
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SSCA2 contains one 
scalable graph generator and four computing kernels. The 
scalable graph generator generates a power-law scale-free 
graph for the computing kernels. The computing kernels all 
require irregular access to the graph's data structure. Since 
Kernel 1-3 are relatively simple and the similar computation 
are already included in kernel 4, we focus on Kernel 4 in our 
research. 

Kernel 4 computes the betweenness centrality of all ver- 
texes in a weighted directed graph. Consider a graph G — 
{V,E), where V and E is the set of vertices and edges re- 
spectively. 

Let (Tst denote the number of shortest paths between ver- 
tices s and t, and (Jst{v) the number of those paths passing 
through V. Betweenness Centrality of a vertex v is defined 
as 



BC{v)= Yl 



(1) 



In the SSCA2 2.2.1 reference implementation, the algo- 
rithm is following the method of Brandes [ll]. Brandes al- 
gorithm computes (5s (n) using a Breadth-first search (BPS) 
process for each vertex s 
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Where pred{s,v) denote the predecessor set of vertex v 
on shortest paths from w. Then BC{v) can be obtained by 
sum up all 5s{v). 

To compute 5^, a BPS and a back trace process are needed. 
In the BPS search process, besides the access sequence of 
each vertex, the predecessor set are also recorded, the depth 
^3(11) and <Jst{v) are computed throughout the process. Por 
each vertex, the computation of astiv) is a multi-source 
adding operation and the computation of predecessor set 
pred{s,v) is a multi-source joining operation. These two 
global operations bring more difficulties for parallelization 
than the original BPS algorithm. We will see in section 5 
the atomic operations are the main obstacle for higher effi- 
ciency. 



The back trace process just uses the result generated dur- 
ing BPS and compute recursively. This process can be done 
in parallel without contention. But it still needs to visit all 
the browsed edges, which means large amount of random 
memory accesses. 

3. ANALYZE THE CELL/BE MEMORY EN- 
GINE WITH GUPS 

Since GUPS is a simple but exhaustive random access 
kernel, we use it as a tool to evaluate the DMA performance 
of CeU/BE. 

The parallelization of GUPS is straightforward: just split 
the r[] array equally to different threads. Since Cell/BE 
does not support threads within SPU, we use a multi-queue 
method to implement GUPS. In each SPU, we maintain 
multiple independent queue. Por each queue, we assign a 
fixed-length DMA-list and keep looping get a trunk of ran- 
dom numbers by a DMA-list operation, do updating, write 
it back to main memory, then get next trunk in sequence. 

The SPU query each thread in turn, once a DMA-list op- 
eration finished, it will be processed immediately until the 
following DMA operation is started and SPU came back to 
the query loop again. 

Three parameters are considered during the test: queue 
numbers within a single SPU, DMA-list queue length for 
each queue, the number of SPUs. 

AU resuhs are obtained from QS20 with IBM Cell/BE 
SDK 3.1, Linux 2.6.25 under 16MB (huge) TLB page size 
unless otherwise stated. The QS20 has 1GB memory, so we 
did all experiments over a 512MB data size for comparison. 
It should be noticed that larger data size would decrease the 
GUPS a little. 

3.1 Single SPU test 

Pirst, we try to figure out the best random access perfor- 
mance of single SPU. 

We vary queue number and queue length. As in Figure 2, 
for single SPU we can get the maximum of GUPS 0.0294. In 
fact it is about 47% of the maximum we can ever get from 
multi-SPUs. The performance improves as queue number 
increase. However there are only a little difference when 
queue number large than 4, normally queue number 8 will 
reach maximum. The larger queue length also brings better 
performance but with a asymptotical improvement. 

Pigure 3 and 4 give the GUPS results of 2 and 4 SPUs. 
We can see the performance with 2 SPUs is nearly doubled. 
In fact it can reach nearly 97% the maximum already. The 
4-SPU result shows the peak was reached easily even with a 
short queue length 8-16. 

Next we fixed queue number at 4 then varying queue 
length and SPU number, as Pigure 5. We can see increas- 
ing SPU numbers does not increase the GUPS after 4, but 
needs shorter queue length. 16 SPUs can reach the peak 
even with queue length 1. The maximum GUPS is 0.062, 
which can be reached in many configurations. 

With the above results, we can draw a conclusion that the 
Cell/BE SPU has a great potential for random access. The 
memory controller is the bottleneck for more GUPS. We can 
infer that if Cell BE were equipped more memory channels 
the GUPS would easily increased. 

3.2 The effect of TLB page size 
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Figure 2: Single SPU test, varying queue numbers 
and queue length 



By default the Cell/BE use hardware managed TLB. The 
page size is 4KB. Each SPU have a 256 entries TLB ta- 
ble. So once the memory data size is larger than 1MB, 
random access will cause TLB miss and reload frequently, 
which has relatively larger overhead for Cell/BE. This con- 
fused us much at the early stage of the work. The effect can 
be viewed from figure 6,7 

From Figure 6, we can see for one SPU the peak per- 
formance is only about 20% of previous result. A strange 
phenomenon is that larger queue length will get even worse 
result. In figure 7, 4 SPUs do not saturate the bus any 
more. With all 16 SPUs the performance can only reach 
about 56% of the peak of HugeTLB case. We can draw a 
conclusion that TLB page size has a large influence on the 
application with random memory accesses. 

3.3 Comparison over different platforms 

We compared 4 platforms. One is IBM QS20 which has 
1GB XDR Ram, another is the newer IBM QS22 which has 
32G DDR2 SDRAM, two dual 128-bit DDR2-800M memory 
channels. The third platform 'Opteron' is a quad processor 
SMP using AMD 4-core Opteron 8347. Each core has 512KB 
L2 cache, IK TLB entries, running at 1.9Ghz. Each proces- 
sor has a shared 2MB L3 cache. It has 4 dual-channel DDR2 
memory controllers, the same as QS22 but a lower 533MHz. 
Bandwidth between processors is 8GB/s. The last platform 
'Nehalem' is a dual processor SMP using the latest Intel 4- 
core Xeon 5530. Each core has 256KB L2 cache, two hyper- 
thread, running at 2.4GHz. Each processor has 8MB shared 
L3 cache. It has 2 dual-channel 1333MHz DDR3 memory. 
This platform has total 16 physical threads. 

We use the C / OpenMP reference implementation for X86_64 
platforms, compilers are PGI 7.2 and ICC 11.0. To compare, 
we also used huge TLB (2MB). 

For all platforms, we use "numactl" j3j utility to make 
sure the data spread on all memory channels. We use queue 
number 4 and queue length 16 for all test. 

The QS20 has about 15% higher GUPS than QS22 ver- 
sion. This shows XDR memory is good at interleaving. The 
opteron platform gets a max 0.033 GUPS, about a half of 
QS20. It should be noticed that on Opteron 16-core has 



Figure 3: 2-SPU test, varying queue number and 
queue length 
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Figure 4: 4-SPU test, varying queue number and 
queue length 
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Figure 5: fix queue number = 4, varying queue 
length and SPU number 



a worse result than 4-8 cores. This may due to the lim- 




Figure 6: GUPS with 4K page, l-SPU,varying 
queue number and queue number 
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ited cross-processor bandwidth of Opteron. On the con- 
trary, even the 4K-page QS20 can get highest GUPS with 
16 SPUs. The Nehalem platform is similar to opteron, with 
a max 0.043 GUPS on 4-thread, about 70% of QS20. 

We can see the effect of TLB page size on Opteron and 
Nehalem are not as large as Cell/BE for they have more 
powerful TLB mechanisims. 

Overall we can see at least for GUPS, Cell/BE is a better 
platform than conventional multi-core platforms. 

4. THE IMPLEMENTATION OF SSCA#2 OVER 
CELL/BE 

The pseudo code of SSCA2 kernel 4 V2.2 is as follows flsl 



Input: G{V,E) /*\V\ = 2^"""=, 
Output: Array BC'[1 ...n] 

1 for all V £ V in parallel do 
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Figure 8: GUPS of different Platforms,varying 
core/thread number (HT: huge,TLB, ST:Small tlb) 



3 let Vs Q V and \Vs\ = 2'"^"'""''"' /*exact vs. approxi- 
mate*/ 

4 for all s G Vs in parallel do 

5 5* empty stacks; 

6 P[w] empty list, w £ V 

7 a[t] ^ 0,t eV;a[s] ^ 1; 

8 d[t]< l,teV;d[s]^0; 

9 queue Q -ir- s; 

10 while / $ do 

11 dequeue t; Q; 

12 push V ^ S ; 

13 for each neighbor m of u in parallel do 

14 if d[w] < then 

15 enqueue w ^ Q; 

16 d[w] ^ d[v] + 1; 

17 if d[w] = d[v] + 1 then 

18 ij[w] <— a[w] + a[v]; 

19 append v — )■ 

20 5[v]^0,veV; 

21 while 5" / do 

22 pop w -ir- S; 

23 for V e P[w] do 

24 S[v]^5[v] + ^^il + 5[w]y, 

25 if w 7^ s then 

26 BC[w] ^ BC[w] + 5[w]; 



Loop 10-19 is the BFS expansion process; loop 21-26 is 
the back trace process. 

Our implementation uses nearly the same process flow and 
data structure of the C/OpenMP version. We start to dis- 
tribute workload on step 11 and 22. The dynamic stack Q is 
divided evenly to all SPUs, then each SPU will check their 
part of Q. It can be seen that the step 16 and 18 are global 
update operations that need atomic operation to assure the 
consistency. Using atomic instruction of Cell/BE, the two 
updates can be done in a single 128-byte getllar-check and 
update-putllc operation. The porting is straightforward at 
first. 

To get better performance, 3 techniques were used accord- 
ing to the feature of Cell/BE: 

4.1 Dynamic load balancing 

The workload is distributed on step 11 and 22. We take 
step 11 as an example. Although workload is divided evenly 
according to Q, the real workload depends on the total num- 
ber of neighbors of each vertex in step 13-19 as well as the 
topology of the graph. These could not be acquired before 
work partitioning. In fact, the scale- free feature of the graph 
increases the unbalance of workload: the neighbors of a ver- 
tex varying from to thousands. So we adopt a dynamic 
load balancing mechanism: each SPU only allocates a small 
number of vertexes from Q each time, and only reallocates 
after finished current work. Since allocation needs synchro- 
nization also, allocating one by one is not acceptable. In our 
experiments, this mechanism enhanced the performance by 
at least 15%. 

4.2 Prefetching use clustered DMA and DMA- 
list. 

The Cell/BE SPU has no local cache and no hardware 
jjrefetching mechanism. Clustering data access have to be 
done by hand. However, the program has many steps with 
data dependences. For example, step 12-19 can be split to 
following steps: 

1) Load V from Q 

2) Load 1 neighbor w of v, load weight of edge {v, w) 

3) Check w and weight 

4) Load d[w] , load a[w] 

5) Check d[w] , update a[w] 

6) Append Q, append P[v] 

Each step is depending on the data or condition from pre- 
vious step. If single word DMA operation were used, then 
most time would be wasted on waiting for last DMA to com- 
plete. So pre-fetch and post-write buffers for each data were 
used. Due to the dynamic size of different data variables, 
this does increase the programming complexity quite a bit. 
In 4), a DMA-list must be used since w is scattering across 
the graph. 

The atomic update in 5) prevents batching DMA to be 
used. We have to do atomic update one-by-one to assure 
consistency. That remains the main delay in the whole pro- 
gram. 



4.3 Software pipelines 

Even using DMA and DMA-list, there is still much time 
wasted for waiting memory I/Os. Sometimes a vertex only 

has 1-2 neighbors that make clustering impossible. So we 
designed a 3-stage software pipeline for step 13 to 19 to 
reduce the latency further: 

Stage 1) Load index {w and weight{w}) 

Stage 2) Load scattered data (d[w], cr[w])) 

Stage 3) Check d[w], do atomic update and post write 

In one loop or time step, stage 1) start loading neighbor of 
Vn+2, stage 2) start loading a[w] of neighbor Vn+i, while the 
stage 3) is updating a[w] of neighbor v„- Triple buffers are 
used for three stages. By using the software pipeline, it is 
no need to do immediate wait-for-complete for all normal 
DMA operations. This allows more overlap of various DMA 
operations that can better utilize the DMA capability of 
MFC. 

The scale-free graph adds complexity here again. Since 
some vertex may have thousands of neighbors, it has to be 
spread on multi-stage; for some vertex with zero neighbors, 
an empty stage is inserted. So finally we have an irregular 
software pipeline with dependency between stages. 

The software pipeline works fluent and add at least an- 
other 15% performance. But profiling shows the stage 3 
occupies the most time due to the atomic operations that 
cause stop and wait. 

For step 22-24, we use another similiar software pipeline. 

To summary, porting SSCA2 is not an easy task. Not only 
because the algorithm itself is relative complex, the varying 
workload and data structure size add difficulties for a better 
performance. 

5. PERFORMANCE EVALUATION OF SSCA#2 

5.1 SSCA2 behavior on QS20 

We use SSCA#2 Kernel 4 with scale 18-22, K4Approx= 
8. For Scale 22 more than half of the memory on QS22 was 
used. Figure. 9 shows the different run time when varying 
cores and scale, all axis are in log scale. For fixed core num- 
ber we can see a nearly straight line. This means the perfor- 
mance is not changed much for different problem size. We 
can also see a near linear speedup when we add more SPUs, 
16-SPU will reach the peak, about 65.8MTEPS. Compared 
with above result of GUPS where only 4 SPUs will used up 
the memory path, we can incur that our implementation has 
not fully utilized the DMA power of single SPU. The reason 
may due to the idle delay caused by atomic updates. 

5.2 The internal profiling result 

Using the built-in decrementer of Cell/BE SPU, we ana- 
lyzed the internal loop of SSCA2 code. Normally the process 
time ratio for the BFS and the back trace process is about 
3.45: 1. 

Since a software pipeline was used, all normal DMA op- 
erations are asynchronous. It is difficult to tell the exact 
execution time of each DMA. The exception is atomic up- 
date, which a stop-and-wait must be used. The time ratio of 
the three stages is about 1: 3: 25, while in stage 3, the time 
period for atomic update occupies about 80%. In average 
each atomic update operations elapses about 630ns on QS20, 
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SSCA#2 on QS20 for varying scale and 



and 550ns on QS22. It should be mentioned that there are 
still background DMA operations working when the atomic 
update operation is being executed. So the portion of pure 
delay brought by atomic operation is undetermined yet. 

5.3 Comparison over different platforms 

We use scale=22, K4approx=8 and varying the core/thread 
number for different platforms. In this test a Sun T2 5220 
(niagara 2) was added. It has IGhz processor, 8 core, 64 
physical thread . It has 4 dual-channel FB-DIMM , nearly 
60GB memory bandwidth. A special optimized OpenMP 
version was used. 




ing as core/threads increase. It suggests the memory band- 
width are not fully utilized due to more processing logic are 
needed for such a complex application. 

Optimization of SSCA#2 for highly multithreaded archi- 
tecture -e.g. SUN Niagara 2- is much more straightforward. 
But this work indicated that after applying multiple tech- 
niques to hide the memory access latency, the performance 
of Cell/BE is comparable to Niagara 2 and better than con- 
ventional multi-core platforms. 

6. RELATED WORKS 

Few literatures deal with the memory-intensive applica- 
tion on Cell/BE. Papers [l2)[5] have studied FFT over Cell, 
which has a scattered but regular memory access pattern. 
FFT do have similar feature as GUPS and SSCA2. From 
these work we got valuable hints include the huge TLB page 
and fast SPU synchronization. 

[16| gave a detailed analysis of the communication per- 
formance of Cell/BE using micro-benchmark, which encour- 
aged our work on GUPS. They focused on the bus perfor- 
mance and did not give the result when large data set was 
used. 

[6] presented a software thread idea for list-ranking, which 
induced us for the GUPS implementation. For the SSCA2, 
the irregular data size makes much trouble for thread par- 
tition. So eventually we used a software pipeline method 
instead. 

[19| designed a delicate lock- free BFS algorithm on Cell/ 
BE. The algorithm depends on a bitmap in SPU's on-chip 
memory. During the optimization of SSCA2 over Cell/BE, 
we found the main obstacle was the global atomic update. 
Each atomic operation will pause the pipeline with idle wait- 
ing. However it is not easy to design a lock-free algorithm 
due to the amount of globally random data updates. The 
process of SSCA2 need d[ii)], (T[it)] and prev set to be up- 
dated at the same time during the BFS expansion. These 
data structure are too large to fit in the on-chip memory. 

In [is] SSCA2 was porting to an innovative many-core 
platform, which split cores for memory operations and graph 
analysis. 

• 9 discussed how the architectural features of Cray MTA- 
2 support graph analysis application includes list-ranking 
and connected components. 

[H^ gave an implementation of BFS over Cray XMT using 
its unique synchronization feature. 

1171 presented a lock-free algorithm of SSC A2 K4 on multi- 
core X86 platform based on partitioned data structure. It is 
still need to check if it is effective on Cell/BE platform also. 

Our implementation of SSCA2 is based on [15| [t] and 
includes the latest change from v2.2.1. We use nearly the 
same memory data structure and flow for comparison. 



Figure 10: SSCA#2 K4, Scale=22, varying 
core/thread number (Niagara should multiply by 4) 

The best result in these platforms is Sun Niagara 2, about 
70.4 MTEPS. The QS20 has a maximum of 65.8MTEPS, 
about 10% fast than QS22 , 17% fast than nehalem and 
about 30% fast than the Opteron platform, which has the 
same core/thread number and near memory bandwidth. 

For all platforms we can see the performance keep improv- 



7. CONCLUSIONS 

In this paper, we investigated two memory-intensive bench- 
marks, GUPS and SSCA2 on the Cell Broad Engine plat- 
form. We find both benchmark has good performance on 
the IBM QS20/22. Compared with 2 conventional multi- 
core system with the near memory bandwidth, the GUPS is 
about 40-80% fast and SSCA2 about 17-30% fast. By using 
dynamic load balancing and software pipeline in SSCA2 we 
showed that a relatively complex graph analysis application 
can be port to Cell/BE platform and get a better perfor- 



mance than conventional nmlti-corc platform. 

Our works shows that the Cell/BE SPU DMA engine has 
potential capability for more random accesses, which is re- 
stricted by the memory controller; the TLB page size will af- 
fect the random access performance greatly on large dataset; 
the overall memory access performance will be degraded if 
large amount of atomic operation exists. 

There remains an open jjroblem whether there is an effi- 
cient lock-free algorithm for SSCA2 to exploit more memory 
access capability of the Cell/BE platform. 
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